1 Data

Two datasets are used for this analysis: “visitor.csv” and “quotes.csv”. “visitor.csv” contains a list of unique visitors who started request forms in either house cleaning or local moving categories from August 1 to August 28 of 2016. It has 59996 observations and 8 variables. “quotes.csv” contains a list of quotes that professionals sent in response to customer requests.It has a total of 64330 observations and 5 variables.

2 Analysis

The following R packages were used:
‘plotly’ for interactive plotting, ‘dplyr’ for data manipulation, ‘lubridate’ for date conversion, ‘knitr’ and ‘kableExtra’ for knitting report with customized styles.

2.1 Visters and Conversion Rate

2.1.1

From the above line chart of number of visitors from August 1st to August 28th, we see that the number of visitors reached the peak on every Monday of each week. After Monday, the number of visitors started to decrease. The number of visitors decreased to its minimum during a week on either Saturday or Sunday.

2.1.2

From August 1st to August 28th, 56.9% of people who visited XX company submitted project requests while 43.1% of people did not.

2.1.3

Number of Visitors
House Cleaning (One Time) Local Moving (under 50 miles) Sum
desktop 8287 24008 32295
mobile 9060 18641 27701
Sum 17347 42649 59996

Some facts we can get from the table above:
The total number of visitors was equal to 59996.
The number of visitors on desktops(32295, 54%) was greater than that on mobile devices(27701, 46%).
There were more visitors who visited XX company for Local Moving(42649, 71%) compared to House Cleaning(17347, 29%).

Number of Requests
House Cleaning (One Time) Local Moving (under 50 miles) Sum
desktop 5297 13411 18708
mobile 5669 9769 15438
Sum 10966 23180 34146

The number of requests submitted from Desktops(18708) was greater than that submitted from Mobile devices(15438).
There were more visitors who sumbitted requests for Local Moving(23180) compared to House Cleaning(10966).

From the last two tables, we see that there were more visitors browing the website and submitting requests using desktops. Let’s calculate the percentage of requests submitted on desktop and on mobile devices regardless of what service they needed:

percentage of requests submitted on Desktop: \(18708/32295 = 0.579\)
percentage of requests submitted on Mobile: \(15438/27701= 0.557\)


So, the percentages of requests submitted on both devices are quite close with each other, with the one on desktop is slightly higher.

The last two tables also show there were more visitors browing the website for Local Moving compared with House Cleaning and submitted request for that. Let’s calculate the percentage of requests submitted for House Cleaning and for Local Moving regardless of what device they used:

percentage of requests submitted for House Cleaning: \(10966/17347 = 0.632\)
percentage of requests submitted for Local Moving: \(23180/42649= 0.544\)

So, although more people visited the website for Local Moving, the percentage of requests for it is about 10% lower than that for House Cleaning.

Next, let’s dig deeper and see if the requests of different categories hold different request rate on different devices.

From the above tables, it is clear that the percentage of requests for Housing Cleaning was greater than that for Local Moving on either device.

2.2 Quotes Per Request

2.2.1

Data Cleaning is involved in this part of the analyses: request_id 7432 has 6 quotes, which should be an error since the number of quotes for each request should be between 0 and 5. Therefore, I removed that row and got the distribution plot of number of quotes per request as below with the rest of data.

request_id num_quotes
7432 6

There are a total of 26858 requests while 64229 quotes. From the bar chart we can see that most requests got 1~3 quotes, with the most of them getting 2 quotes. The distribution of number of requests per quote has a right skew with a heavy right-tail, meaning a large amount of requests got quotes no larger than 3.

A pie chart gives us a clearer look about the proportion of requests for different number of quotes. About 81% (28.4% + 29.7% + 23%) of the requests got 1~3 quotes. Requests with 2 quotes took about 30% while 11.7% of requests got 4 quotes and only 7% got 5 quotes.

2.2.2

## 
## Call:
## lm(formula = num_quotes ~ how_far + num_bedrooms + num_bathrooms, 
##     data = req)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8853 -1.2107 -0.2708  0.7292  3.0059 
## 
## Coefficients: (1 not defined because of singularities)
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      2.28497    0.08581  26.628  < 2e-16 ***
## how_far11 - 20 miles             0.27517    0.08791   3.130  0.00175 ** 
## how_far21 - 30 miles             0.57525    0.08978   6.407 1.51e-10 ***
## how_far31 - 50 miles             0.60035    0.09060   6.626 3.50e-11 ***
## how_far5 - 10 miles             -0.01420    0.08749  -0.162  0.87106    
## how_farLess than 5 miles         0.01166    0.08719   0.134  0.89358    
## how_farWithin the same building -0.23944    0.09976  -2.400  0.01640 *  
## num_bedrooms1 bedroom           -0.02475    0.12168  -0.203  0.83884    
## num_bedrooms2 bedrooms           0.05330    0.11453   0.465  0.64165    
## num_bedrooms3 bedrooms           0.10285    0.11130   0.924  0.35545    
## num_bedrooms4 bedrooms           0.11492    0.10867   1.058  0.29028    
## num_bedrooms5+ bedrooms          0.11740    0.11173   1.051  0.29340    
## num_bedroomsStudio              -0.12060    0.15134  -0.797  0.42554    
## num_bathrooms1 bathroom         -0.17029    0.07939  -2.145  0.03196 *  
## num_bathrooms1.5 bathrooms      -0.17707    0.08263  -2.143  0.03213 *  
## num_bathrooms2 bathrooms        -0.06920    0.07099  -0.975  0.32968    
## num_bathrooms3 bathrooms        -0.06602    0.06957  -0.949  0.34264    
## num_bathrooms4+ bathrooms             NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.192 on 26841 degrees of freedom
## Multiple R-squared:  0.03428,    Adjusted R-squared:  0.03371 
## F-statistic: 59.56 on 16 and 26841 DF,  p-value: < 2.2e-16

To approach this problem, I merged the two datasets by their common column ‘request_id’ and removed the observation whose number of quotes equal to 6 as mentioned in the last problem. I then fitted a linear regression model with variables how_far, num_bedrooms and num_bathrooms where all of these three variables are treated as categorical variables.

From the above table, we can see that and are the main factors which contribute to some requests getting more quotes than others. Requests which were “within the same building”, “11 - 20 miles”, “21 - 30 miles” and “31 - 50 miles” got more quotes than requests which were “less than or equal to 10 miles”. Requests for 1-1.5 bathrooms got more quotes than requests for greater than or equal to 2 bathrooms.

It needs to be noticed that this result is based on p-value (0.05 as significant level), which assumes all three variables are in the model, which means distance and number of bathrooms are statistically significant considering all three variables are in the model. So, it does not necessarily means the distance or the number of bathrooms does not affect the number of quotes for a request. It also does not mean the number of bedrooms does not have effects on the number of quotes.

With that consideration, Chi-Sq Association Tests were then conducted to test if a single variable itself can affect the number of quotes for a request. Before looking at the results, we need to be clear about the hypothese of our tests and check if the assumptions of Chi-Sq test can be satisfied.

Hypothesis:
Null hypothesis (H0): the two categorical variables of a contingency table are .
Alternative hypothesis (H1): the two categorical variables of a contingency table are .

In our case, “the two categorical variables” means “Number of Quotes and Distance”, “Number of Quotes and Number of Bedrooms”, or “Number of Quotes and Number of Bathrooms”.

Assumptions:
1.The levels (or categories) of the variables are mutually exclusive. 2. The value of the cell expecteds should be 5 or more in at least 80% of the cells, and no cell should have an expected of less than one. 3. Each subject may contribute data to one and only one cell in the ChiSq. 4. The study groups must be independent. 5. There are 2 variables, and both are measured as categories, usually at the nominal level. This is referred from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3900058/.

After checking, our data satisfied the assumptions of the Chi-square (Details can be found in code). Therefore, the Chi-Sq tests were conducted for Number of Quotes VS Distance, Number of Quotes VS Number of Bedrooms, and Number of Quotes VS Number of Bathrooms as below.

## 
##  Pearson's Chi-squared test
## 
## data:  req$num_quotes and req$how_far
## X-squared = 905.81, df = 24, p-value < 2.2e-16
## 
##  Pearson's Chi-squared test
## 
## data:  req$num_quotes and req$num_bedrooms
## X-squared = 186.03, df = 24, p-value < 2.2e-16
## 
##  Pearson's Chi-squared test
## 
## data:  req$num_quotes and req$num_bathrooms
## X-squared = 180.93, df = 20, p-value < 2.2e-16

Three Chi-squared test showed us that each single variable Distance, Number of Bedrooms and Number of Bathrooms is NOT independent from the Number of Quotes. Therefore we can conclude that Distance and Number of Bathrooms are more significant factors when we consider all three factors together. However, each variable has associations with the Number of Quotes.

2.3 Job Value

2.3.1

cat_dist = quote_visitors[!is.na(quote_visitors$quote_price),] 
cat_dist = cat_dist[,c("category_name","quote_price")]
qt1 = quantile(cat_dist[which(cat_dist$category_name == "House Cleaning (One Time)"),]$quote_price)
qt2 = quantile(cat_dist[which(cat_dist$category_name != "House Cleaning (One Time)"),]$quote_price)

qt2 = cbind.data.frame(category = "Local Moving (under 50 miles)",quantiles = names(qt2), prices = qt2)
qt1 = cbind.data.frame(category = "House Cleaning (One Time)",quantiles = names(qt1), prices = qt1)
category_dist = rbind.data.frame(qt1,qt2)
category_dist = category_dist[category_dist$quantiles!="0%" & category_dist$quantiles!="100%",]
 
category_dist %>% filter(category=="House Cleaning (One Time)") %>% plot_ly(x = ~quantiles,y = ~prices, type = "bar") %>% 
  layout(title = "Distribution of House Cleaning Quotes", yaxis =list(title = "Prices"), xaxis =list(title = "Quantiles"))
category_dist %>% filter(category!="House Cleaning (One Time)") %>% plot_ly(x = ~quantiles,y = ~prices, type = "bar", color = "Orange") %>% 
  layout(title = "Distribution of Local Moving Quotes", yaxis =list(title = "Prices"), xaxis =list(title = "Quantiles"))


The above plots are bar charts showing percentile of each quotes. We can see that there are 25% of quotes were priced at at most 99 dollars , 50% of quotes were priced at at most 140 dollars and 25% of quotes were priced at at least 185 dollars for House Cleaning Services.


There are 25% of quotes were priced at at most 250 dollars , 50% of quotes were priced at at most 351 dollars and 25% of quotes were priced at at least 400 dollars for Local Moving Services.

2.3.2


There are many NAs in quotes offered by pros who got hired which means that there is anomaly in the data. So, I thought it would be better to find the average quote by each professional and the average earning of a professional to explore how much XX company should charge professionals to quote so that each professional will benefit from sending quotes. Explainations of certain variables in “pros” are as below:

total_earning = sum(hire_price)). hire_price is defined as quote_price when the professional got hired, otherwise it was equal to 0.

avg_earn = total_earning/times_quoted

avg_price_quote = mean(quote_price)

26.16% of the pros were hired at least once of all professionals. Other pros were not hired at all.

58.37% of the professionals who put up a quote more than once were hired at least once.

We can see the fact that we have a high drop-off rate for professionals in case they are not getting hired for the first time. So, a good solution for new pros retention is that we don’t charge them for a first few times till they are hired. But we should also not allow them to quote any price. Instead, we can limit their quote prices to a range which is derived based on a weighted average of hourly rates of the location and the rates at which professionals are hired for similar task in the locality.